Non-autoregressive Translation with Layer-Wise Prediction and Deep Supervision

نویسندگان

چکیده

How do we perform efficient inference while retaining high translation quality? Existing neural machine models, such as Transformer, achieve performance, but they decode words one by one, which is inefficient. Recent non-autoregressive models speed up the inference, their quality still inferior. In this work, propose DSLP, a highly and high-performance model for translation. The key insight to train Transformer with Deep Supervision feed additional Layer-wise Predictions. We conducted extensive experiments on four tasks (both directions of WMT'14 EN-DE WMT'16 EN-RO). Results show that our approach consistently improves BLEU scores compared respective base models. Specifically, best variant outperforms autoregressive three tasks, being 14.8 times more in inference.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Non-Autoregressive Neural Machine Translation

Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we...

متن کامل

Layer-wise learning of deep generative models

When using deep, multi-layered architectures to build generative models of data, it is difficult to train all layers at once. We propose a layer-wise training procedure admitting a performance guarantee compared to the global optimum. It is based on an optimistic proxy of future performance, the best latent marginal. We interpret autoencoders in this setting as generative models, by showing tha...

متن کامل

Layer-wise training of deep generative models

When using deep, multi-layered architectures to build generative models of data, it is difficult to train all layers at once. We propose a layer-wise training procedure admitting a performance guarantee compared to the global optimum. It is based on an optimistic proxy of future performance, the best latent marginal. We interpret autoencoders in this setting as generative models, by showing tha...

متن کامل

Greedy Layer-Wise Training of Deep Networks

Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until re...

متن کامل

Layer-wise analysis of deep networks with Gaussian kernels

Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after la...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2022

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v36i10.21323